In this chapter we'll be looking at the functionality that Biopython provides for reading and writing files. This is all handled by the Bio.SeqIO
module so let's start by importing it:
from Bio import SeqIO
For this section you need to download some files which we will be reading in. You can either download them by hand from using these links: V01408.1.fasta, ls_orchid_short.gbk and ls_orchid.gbk or run the following Python code:
import urllib
for f in ["V01408.1.fasta", "ls_orchid_short.fasta", "ls_orchid.gbk"]:
urllib.request.urlretrieve(f"https://milliams.com/courses/biopython/{f}", f)
SeqRecord
objects¶If you have a FASTA file with a single sequence in it, the simplest way to read it in is with SeqIO.read
. This takes two arguments, the name of the file you want to open, and the format that file is in.
tm = SeqIO.read("V01408.1.fasta", "fasta")
print(tm)
The read
functions returns an object called a SeqRecord
. These are like the Seq
s that we saw before but have additional information associated with them. A full list of these are:
.seq
Seq
object as we saw in the last chapter..id
.name
.description
.letter_annotations
.annotations
.features
SeqFeature
objects with more structured information about the features on a sequence (e.g. position of genes on a genome, or domains on a protein sequence)..dbxrefs
So you can get at the Seq
object fom inside the record with:
tm.seq
Have a look at the value of some of these attributes of the tm
object. Which are available and which are missing or empty?
It is very common to have files which contain multiple sequences. Biopython provides an interface to read these in, one after another which is the SeqIO.parse
function. This provides you with an object which you can loop over with a for
loop:
for record in SeqIO.parse("ls_orchid_short.fasta", "fasta"):
print(record.description)
Each time around the loop, you are given a SeqRecord
object which will work in exactly the same way as in the previous section, so getting the .description
works fine.
Load in the ls_orchid_short.fasta
file and print out the length of each sequence, along with its description.
The type of the object returned by SeqIO.parse
is known as a generator. One of the features of a generator in Python is that you can only loop over them once before they are exhausted. For example you can loop over them once without issue:
records = SeqIO.parse("ls_orchid_short.fasta", "fasta")
print([r.id for r in records])
But if you try to use the same object again, you will see that you get nothing back:
print([r.id for r in records])
The reason for this is that it allows you to access, one at a time, a very long list of sequences, potentially more than can fit in memory at once since it only loads them one at a time.
If you know you have a short set of sequences (as we have in this tutorial) then you can load them all into a Python list and access them however you wish:
records = list(SeqIO.parse("ls_orchid_short.fasta", "fasta"))
print([r.id for r in records])
print([r.id for r in records])
As well as FASTA files, Biopython can read GenBank files. All you need to do is specify the filetype when calling the SeqIO.parse
function. If you pass "genbank"
(or "gb"
) as the second argument then it will read it as a GenBankfile:
record_iterator = SeqIO.parse("ls_orchid.gbk", "genbank")
If you are loading a file with multiple sequences in, you can grab just the first one with Python's next
function. This gives you whatever would be available the first time around the loop:
first_record = next(record_iterator)
GenBank fiiles usually contain a lot more information than a FASTA so more of the fields of the SeqRecord
will be filled in. This means that we can, for example, see the annotations that the sequecne has:
first_record.annotations
Take a look at the record and see what other SeqRecord
attributes are filled in which were missing from the FASTA file we loaded earlier.
We’ve talked about using Bio.SeqIO.parse
for sequence input (reading files), and now we’ll look at Bio.SeqIO.write
which is for sequence output (writing files). This is a function taking three arguments: some SeqRecord
objects, a handle or filename to write to, and a sequence format.
Here is an example, where we start by creating a few SeqRecord
objects the hard way (by hand, rather than by loading them from a file):
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
rec1 = SeqRecord(
Seq(
"MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD"
"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK"
"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM"
"SSAC",
),
id="gi|14150838|gb|AAK54648.1|AF376133_1",
description="chalcone synthase [Cucumis sativus]",
)
rec2 = SeqRecord(
Seq(
"YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ"
"DMVVVEIPKLGKEAAVKAIKEWGQ",
),
id="gi|13919613|gb|AAK33142.1|",
description="chalcone synthase [Fragaria vesca subsp. bracteata]",
)
my_records = [rec1, rec2]
Now we have a list of SeqRecord
objects, we’ll write them to a FASTA format file:
SeqIO.write(my_records, "my_example.faa", "fasta")
The 2
that gets returned tells you how many records were written.
Create a SeqRecord
by hand and write it to a .fasta
file. Then read that file in and check that the details match.